Air pollution refers to the existence of pollutants like dust, fumes, gas, mist, odor, smoke, or vapor in the atmosphere, at levels and for duration that are harmful to human health. The primary route through which these pollutants affect people is via the respiratory system. 1Air pollution originates from a variety of sources, each contributing differently to the overall pollution levels. Mobile sources, including cars, buses, planes, trucks, and trains, are a major contributor, especially in urban areas. Stationary sources, like power plants, oil refineries, industrial facilities, and factories, also play a significant role, often emitting large quantities of pollutants. Area sources, which cover broader regions, include agricultural areas, urban settings, and wood-burning fireplaces, contributing to localized pollution. Finally, natural sources such as wind-blown dust, wildfires, and volcanic activities add to the air pollution mix, though their occurrences and impacts can vary greatly depending on geographical and environmental conditions.2While numerous toxins negatively affect health, the most concerning pollutants for public health, based on strong evidence, are particulate matter (PM), carbon monoxide (CO), ozone (O3), nitrogen dioxide (NO2), and sulphur dioxide (SO2). These substances are particularly notable for their detrimental health impacts.3
Particle pollution, also known as particulate matter (PM), consists of small solid or liquid particles present in the air. This includes substances like dust, dirt, soot, smoke, and liquid droplets. While some of these particles are large or dark enough to be visible, others are too small to be seen with the naked eye. Inhaling this type of pollution can pose health risks. Coarser particles, known as PM10, can cause irritation to the eyes, nose, and throat and are commonly found in dust from various sources such as roads, farms, dry riverbeds, construction sites, and mines. Finer particles, referred to as PM2.5, are particularly hazardous as they can penetrate deep into the lungs and even enter the bloodstream.4 Exposure to particle pollution, whether for short or long duration, can result in various health issues. These include serious conditions such as stroke, chronic obstructive pulmonary disease (COPD), and cancers affecting the trachea, bronchus, and lungs. Additionally, air pollution can worsen asthma symptoms and lead to lower respiratory infections.5 It’s crucial to assess the health impacts of particulate matter (PM) by using methods that can accurately measure variations in the concentration of different PM sizes over time and space. This is particularly important for studies covering large geographic areas, as the levels and effects of each PM size fraction can vary significantly across different locations and periods, influencing health outcomes.6
Epidemiologists rely on a network of air quality monitors spread across the United States to gauge levels of air pollution and its impacts on public health. However, these monitors are not uniformly distributed, leading to sparse data in many areas. This scarcity hampers the granularity of air pollution monitoring, which is critical for detecting the full scope of its health effects and pinpointing at-risk locations. Even within a single city, dramatic variations in pollution rates can be observed, underscoring the concept of “micro-environments”—distinct areas within urban spaces where air quality can differ significantly from one block to the next.7 Consequently, there’s a pressing need to develop new models that can interpolate or extrapolate from limited data points to provide a more detailed and comprehensive view of air pollution and its fine-scale distribution. The Environmental Health journal has featured a study that successfully utilized machine learning techniques to enhance the monitoring of air pollution. This approach leveraged data points such as population and road density to forecast localized air pollution levels.8 Drawing inspiration from this, our goal is to accurately predict the annual average air pollution concentrations within the U.S. at the zip code level, utilizing predictors that encompass population density, urban development, road density, satellite pollution data, and data from chemical atmospheric modeling.
library(OCSdata)
library(here)
library(skimr)
library(dplyr)
library(corrplot)
library(ggplot2)
library(GGally)
## Warning: package 'GGally' was built under R version 4.2.3
library(tidyverse)
library(sf)
library(maps)
library(rnaturalearth)
library(caret)
library(recipes)
library(parsnip)
library(vip)
library(tune)
library(patchwork)
library(readxl)
library(car)
Can we predict US annual average air pollution concentrations at the granularity of zip code regional levels using predictors such as data about population density, urbanization, road density, as well as, satellite pollution data and chemical modeling data?
The dataset we are using for our analysis contains information from gravimetric monitors operated by the US Environmental Protection Agency (EPA).9 These monitors measure PM2.5 (fine particulate matter) concentrations in the air, specifically the average values for the year 2008. The dataset includes various features, such as monitor ID, latitude, longitude, state, county, city, CMAQ (air pollution estimates from a computational model), impervious surface measures, county area and population, distances to roads, emissions data, population density, educational attainment percentages, poverty rates, urban-rural classifications, and aerosol optical depth measurements from NASA satellites. There are a total of 48 features for each of the 876 monitors in the dataset. This data comes from multiple sources, including the EPA10, NASA11, the US Census12, and the National Center for Health Statistics13, and it provides valuable insights into air pollution and its potential drivers across the United States. Data is collected and organized by researcher Roger D. Peng in John Hopkins Univerisity. 14
#OCSdata::raw_data("ocs-bp-air-pollution", outpath = getwd())
pm <- readr::read_csv(here::here("OCS_data", "data","raw", "pm25_data.csv"))
## Rows: 876 Columns: 50
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): state, county, city
## dbl (47): id, value, fips, lat, lon, CMAQ, zcta, zcta_area, zcta_pop, imp_a5...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
skimr::skim(pm)
| Name | pm |
| Number of rows | 876 |
| Number of columns | 50 |
| _______________________ | |
| Column type frequency: | |
| character | 3 |
| numeric | 47 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| state | 0 | 1 | 4 | 20 | 0 | 49 | 0 |
| county | 0 | 1 | 3 | 20 | 0 | 471 | 0 |
| city | 0 | 1 | 4 | 48 | 0 | 607 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| id | 0 | 1 | 26987.96 | 1.578761e+04 | 1003.00 | 13089.15 | 26132.00 | 39118.00 | 5.603910e+04 | ▇▇▆▇▆ |
| value | 0 | 1 | 10.81 | 2.580000e+00 | 3.02 | 9.27 | 11.15 | 12.37 | 2.316000e+01 | ▂▆▇▁▁ |
| fips | 0 | 1 | 26987.89 | 1.578763e+04 | 1003.00 | 13089.00 | 26132.00 | 39118.00 | 5.603900e+04 | ▇▇▆▇▆ |
| lat | 0 | 1 | 38.48 | 4.620000e+00 | 25.47 | 35.03 | 39.30 | 41.66 | 4.840000e+01 | ▁▃▅▇▂ |
| lon | 0 | 1 | -91.74 | 1.496000e+01 | -124.18 | -99.16 | -87.47 | -80.69 | -6.804000e+01 | ▃▂▃▇▃ |
| CMAQ | 0 | 1 | 8.41 | 2.970000e+00 | 1.63 | 6.53 | 8.62 | 10.24 | 2.313000e+01 | ▃▇▃▁▁ |
| zcta | 0 | 1 | 50890.29 | 2.778447e+04 | 1022.00 | 28788.25 | 48172.00 | 74371.00 | 9.920200e+04 | ▅▇▇▅▇ |
| zcta_area | 0 | 1 | 183173481.91 | 5.425989e+08 | 15459.00 | 14204601.75 | 37653560.50 | 160041508.25 | 8.164821e+09 | ▇▁▁▁▁ |
| zcta_pop | 0 | 1 | 24227.58 | 1.777216e+04 | 0.00 | 9797.00 | 22014.00 | 35004.75 | 9.539700e+04 | ▇▇▃▁▁ |
| imp_a500 | 0 | 1 | 24.72 | 1.934000e+01 | 0.00 | 3.70 | 25.12 | 40.22 | 6.961000e+01 | ▇▅▆▃▂ |
| imp_a1000 | 0 | 1 | 24.26 | 1.802000e+01 | 0.00 | 5.32 | 24.53 | 38.59 | 6.750000e+01 | ▇▅▆▃▁ |
| imp_a5000 | 0 | 1 | 19.93 | 1.472000e+01 | 0.05 | 6.79 | 19.07 | 30.11 | 7.460000e+01 | ▇▆▃▁▁ |
| imp_a10000 | 0 | 1 | 15.82 | 1.381000e+01 | 0.09 | 4.54 | 12.36 | 24.17 | 7.209000e+01 | ▇▃▂▁▁ |
| imp_a15000 | 0 | 1 | 13.43 | 1.312000e+01 | 0.11 | 3.24 | 9.67 | 20.55 | 7.110000e+01 | ▇▃▁▁▁ |
| county_area | 0 | 1 | 3768701992.12 | 6.212830e+09 | 33703512.00 | 1116536297.50 | 1690826566.50 | 2878192209.00 | 5.194723e+10 | ▇▁▁▁▁ |
| county_pop | 0 | 1 | 687298.44 | 1.293489e+06 | 783.00 | 100948.00 | 280730.50 | 743159.00 | 9.818605e+06 | ▇▁▁▁▁ |
| log_dist_to_prisec | 0 | 1 | 6.19 | 1.410000e+00 | -1.46 | 5.43 | 6.36 | 7.15 | 1.045000e+01 | ▁▁▃▇▁ |
| log_pri_length_5000 | 0 | 1 | 9.82 | 1.080000e+00 | 8.52 | 8.52 | 10.05 | 10.73 | 1.205000e+01 | ▇▂▆▅▂ |
| log_pri_length_10000 | 0 | 1 | 10.92 | 1.130000e+00 | 9.21 | 9.80 | 11.17 | 11.83 | 1.302000e+01 | ▇▂▇▇▃ |
| log_pri_length_15000 | 0 | 1 | 11.50 | 1.150000e+00 | 9.62 | 10.87 | 11.72 | 12.40 | 1.359000e+01 | ▆▂▇▇▃ |
| log_pri_length_25000 | 0 | 1 | 12.24 | 1.100000e+00 | 10.13 | 11.69 | 12.46 | 13.05 | 1.436000e+01 | ▅▃▇▇▃ |
| log_prisec_length_500 | 0 | 1 | 6.99 | 9.500000e-01 | 6.21 | 6.21 | 6.21 | 7.82 | 9.400000e+00 | ▇▁▂▂▁ |
| log_prisec_length_1000 | 0 | 1 | 8.56 | 7.900000e-01 | 7.60 | 7.60 | 8.66 | 9.20 | 1.047000e+01 | ▇▅▆▃▁ |
| log_prisec_length_5000 | 0 | 1 | 11.28 | 7.800000e-01 | 8.52 | 10.91 | 11.42 | 11.83 | 1.278000e+01 | ▁▁▃▇▃ |
| log_prisec_length_10000 | 0 | 1 | 12.41 | 7.300000e-01 | 9.21 | 11.99 | 12.53 | 12.94 | 1.385000e+01 | ▁▁▃▇▅ |
| log_prisec_length_15000 | 0 | 1 | 13.03 | 7.200000e-01 | 9.62 | 12.59 | 13.13 | 13.57 | 1.441000e+01 | ▁▁▃▇▅ |
| log_prisec_length_25000 | 0 | 1 | 13.82 | 7.000000e-01 | 10.13 | 13.38 | 13.92 | 14.35 | 1.523000e+01 | ▁▁▃▇▆ |
| log_nei_2008_pm25_sum_10000 | 0 | 1 | 3.97 | 2.350000e+00 | 0.00 | 2.15 | 4.29 | 5.69 | 9.120000e+00 | ▆▅▇▆▂ |
| log_nei_2008_pm25_sum_15000 | 0 | 1 | 4.72 | 2.250000e+00 | 0.00 | 3.47 | 5.00 | 6.35 | 9.420000e+00 | ▃▃▇▇▂ |
| log_nei_2008_pm25_sum_25000 | 0 | 1 | 5.67 | 2.110000e+00 | 0.00 | 4.66 | 5.91 | 7.28 | 9.650000e+00 | ▂▂▇▇▃ |
| log_nei_2008_pm10_sum_10000 | 0 | 1 | 4.35 | 2.320000e+00 | 0.00 | 2.69 | 4.62 | 6.07 | 9.340000e+00 | ▅▅▇▇▂ |
| log_nei_2008_pm10_sum_15000 | 0 | 1 | 5.10 | 2.180000e+00 | 0.00 | 3.87 | 5.39 | 6.72 | 9.710000e+00 | ▂▃▇▇▂ |
| log_nei_2008_pm10_sum_25000 | 0 | 1 | 6.07 | 2.010000e+00 | 0.00 | 5.10 | 6.37 | 7.52 | 9.880000e+00 | ▁▂▆▇▃ |
| popdens_county | 0 | 1 | 551.76 | 1.711510e+03 | 0.26 | 40.77 | 156.67 | 510.81 | 2.682191e+04 | ▇▁▁▁▁ |
| popdens_zcta | 0 | 1 | 1279.66 | 2.757490e+03 | 0.00 | 101.15 | 610.35 | 1382.52 | 3.041884e+04 | ▇▁▁▁▁ |
| nohs | 0 | 1 | 6.99 | 7.210000e+00 | 0.00 | 2.70 | 5.10 | 8.80 | 1.000000e+02 | ▇▁▁▁▁ |
| somehs | 0 | 1 | 10.17 | 6.200000e+00 | 0.00 | 5.90 | 9.40 | 13.90 | 7.220000e+01 | ▇▂▁▁▁ |
| hs | 0 | 1 | 30.32 | 1.140000e+01 | 0.00 | 23.80 | 30.75 | 36.10 | 1.000000e+02 | ▂▇▂▁▁ |
| somecollege | 0 | 1 | 21.58 | 8.600000e+00 | 0.00 | 17.50 | 21.30 | 24.70 | 1.000000e+02 | ▆▇▁▁▁ |
| associate | 0 | 1 | 7.13 | 4.010000e+00 | 0.00 | 4.90 | 7.10 | 8.80 | 7.140000e+01 | ▇▁▁▁▁ |
| bachelor | 0 | 1 | 14.90 | 9.710000e+00 | 0.00 | 8.80 | 12.95 | 19.22 | 1.000000e+02 | ▇▂▁▁▁ |
| grad | 0 | 1 | 8.91 | 8.650000e+00 | 0.00 | 3.90 | 6.70 | 11.00 | 1.000000e+02 | ▇▁▁▁▁ |
| pov | 0 | 1 | 14.95 | 1.133000e+01 | 0.00 | 6.50 | 12.10 | 21.22 | 6.590000e+01 | ▇▅▂▁▁ |
| hs_orless | 0 | 1 | 47.48 | 1.675000e+01 | 0.00 | 37.92 | 48.65 | 59.10 | 1.000000e+02 | ▁▃▇▃▁ |
| urc2013 | 0 | 1 | 2.92 | 1.520000e+00 | 1.00 | 2.00 | 3.00 | 4.00 | 6.000000e+00 | ▇▅▃▂▁ |
| urc2006 | 0 | 1 | 2.97 | 1.520000e+00 | 1.00 | 2.00 | 3.00 | 4.00 | 6.000000e+00 | ▇▅▃▂▁ |
| aod | 0 | 1 | 43.70 | 1.956000e+01 | 5.00 | 31.66 | 40.17 | 49.67 | 1.430000e+02 | ▃▇▁▁▁ |
First, let’s use the skimr package to get a preliminary look at our data. Our data has 876 rows and 50 columns. n_missing represents missing data. From this we can see that there is no missing data in this dataset.
There are two types of column types: character and numeric. Character columns include state, county and city variables. We can see there are 49 states in the dataset. However, there are 50 states in the U.S. We can take a look at which states are included.
pm|>
distinct(state)
## # A tibble: 49 × 1
## state
## <chr>
## 1 Alabama
## 2 Arizona
## 3 Arkansas
## 4 California
## 5 Colorado
## 6 Connecticut
## 7 Delaware
## 8 District Of Columbia
## 9 Florida
## 10 Georgia
## # ℹ 39 more rows
In the dataset, District of Columbia is counted as a seperate state. Hawaii and Alaska data is not included in the dataset.
Id (Monitor number), fips (Federal information processing standard number for the county where the monitor is located), zcta (Zip Code Tabulation Area where the monitor is located) are handled as numeric here. These variables consist of numbers but there is no specific orders between these numbers. Therefore, they should be handled as factors instead of numeric. We mutate these variables and change all of them into factors using as.factor function.
pm <- pm |>
mutate(across(c(id, fips, zcta), as.factor))
skimr::skim(pm)
| Name | pm |
| Number of rows | 876 |
| Number of columns | 50 |
| _______________________ | |
| Column type frequency: | |
| character | 3 |
| factor | 3 |
| numeric | 44 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| state | 0 | 1 | 4 | 20 | 0 | 49 | 0 |
| county | 0 | 1 | 3 | 20 | 0 | 471 | 0 |
| city | 0 | 1 | 4 | 48 | 0 | 607 | 0 |
Variable type: factor
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|
| id | 0 | 1 | FALSE | 876 | 100: 1, 102: 1, 103: 1, 104: 1 |
| fips | 0 | 1 | FALSE | 569 | 170: 12, 603: 10, 261: 9, 107: 8 |
| zcta | 0 | 1 | FALSE | 842 | 475: 3, 110: 2, 160: 2, 290: 2 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| value | 0 | 1 | 10.81 | 2.580000e+00 | 3.02 | 9.27 | 11.15 | 12.37 | 2.316000e+01 | ▂▆▇▁▁ |
| lat | 0 | 1 | 38.48 | 4.620000e+00 | 25.47 | 35.03 | 39.30 | 41.66 | 4.840000e+01 | ▁▃▅▇▂ |
| lon | 0 | 1 | -91.74 | 1.496000e+01 | -124.18 | -99.16 | -87.47 | -80.69 | -6.804000e+01 | ▃▂▃▇▃ |
| CMAQ | 0 | 1 | 8.41 | 2.970000e+00 | 1.63 | 6.53 | 8.62 | 10.24 | 2.313000e+01 | ▃▇▃▁▁ |
| zcta_area | 0 | 1 | 183173481.91 | 5.425989e+08 | 15459.00 | 14204601.75 | 37653560.50 | 160041508.25 | 8.164821e+09 | ▇▁▁▁▁ |
| zcta_pop | 0 | 1 | 24227.58 | 1.777216e+04 | 0.00 | 9797.00 | 22014.00 | 35004.75 | 9.539700e+04 | ▇▇▃▁▁ |
| imp_a500 | 0 | 1 | 24.72 | 1.934000e+01 | 0.00 | 3.70 | 25.12 | 40.22 | 6.961000e+01 | ▇▅▆▃▂ |
| imp_a1000 | 0 | 1 | 24.26 | 1.802000e+01 | 0.00 | 5.32 | 24.53 | 38.59 | 6.750000e+01 | ▇▅▆▃▁ |
| imp_a5000 | 0 | 1 | 19.93 | 1.472000e+01 | 0.05 | 6.79 | 19.07 | 30.11 | 7.460000e+01 | ▇▆▃▁▁ |
| imp_a10000 | 0 | 1 | 15.82 | 1.381000e+01 | 0.09 | 4.54 | 12.36 | 24.17 | 7.209000e+01 | ▇▃▂▁▁ |
| imp_a15000 | 0 | 1 | 13.43 | 1.312000e+01 | 0.11 | 3.24 | 9.67 | 20.55 | 7.110000e+01 | ▇▃▁▁▁ |
| county_area | 0 | 1 | 3768701992.12 | 6.212830e+09 | 33703512.00 | 1116536297.50 | 1690826566.50 | 2878192209.00 | 5.194723e+10 | ▇▁▁▁▁ |
| county_pop | 0 | 1 | 687298.44 | 1.293489e+06 | 783.00 | 100948.00 | 280730.50 | 743159.00 | 9.818605e+06 | ▇▁▁▁▁ |
| log_dist_to_prisec | 0 | 1 | 6.19 | 1.410000e+00 | -1.46 | 5.43 | 6.36 | 7.15 | 1.045000e+01 | ▁▁▃▇▁ |
| log_pri_length_5000 | 0 | 1 | 9.82 | 1.080000e+00 | 8.52 | 8.52 | 10.05 | 10.73 | 1.205000e+01 | ▇▂▆▅▂ |
| log_pri_length_10000 | 0 | 1 | 10.92 | 1.130000e+00 | 9.21 | 9.80 | 11.17 | 11.83 | 1.302000e+01 | ▇▂▇▇▃ |
| log_pri_length_15000 | 0 | 1 | 11.50 | 1.150000e+00 | 9.62 | 10.87 | 11.72 | 12.40 | 1.359000e+01 | ▆▂▇▇▃ |
| log_pri_length_25000 | 0 | 1 | 12.24 | 1.100000e+00 | 10.13 | 11.69 | 12.46 | 13.05 | 1.436000e+01 | ▅▃▇▇▃ |
| log_prisec_length_500 | 0 | 1 | 6.99 | 9.500000e-01 | 6.21 | 6.21 | 6.21 | 7.82 | 9.400000e+00 | ▇▁▂▂▁ |
| log_prisec_length_1000 | 0 | 1 | 8.56 | 7.900000e-01 | 7.60 | 7.60 | 8.66 | 9.20 | 1.047000e+01 | ▇▅▆▃▁ |
| log_prisec_length_5000 | 0 | 1 | 11.28 | 7.800000e-01 | 8.52 | 10.91 | 11.42 | 11.83 | 1.278000e+01 | ▁▁▃▇▃ |
| log_prisec_length_10000 | 0 | 1 | 12.41 | 7.300000e-01 | 9.21 | 11.99 | 12.53 | 12.94 | 1.385000e+01 | ▁▁▃▇▅ |
| log_prisec_length_15000 | 0 | 1 | 13.03 | 7.200000e-01 | 9.62 | 12.59 | 13.13 | 13.57 | 1.441000e+01 | ▁▁▃▇▅ |
| log_prisec_length_25000 | 0 | 1 | 13.82 | 7.000000e-01 | 10.13 | 13.38 | 13.92 | 14.35 | 1.523000e+01 | ▁▁▃▇▆ |
| log_nei_2008_pm25_sum_10000 | 0 | 1 | 3.97 | 2.350000e+00 | 0.00 | 2.15 | 4.29 | 5.69 | 9.120000e+00 | ▆▅▇▆▂ |
| log_nei_2008_pm25_sum_15000 | 0 | 1 | 4.72 | 2.250000e+00 | 0.00 | 3.47 | 5.00 | 6.35 | 9.420000e+00 | ▃▃▇▇▂ |
| log_nei_2008_pm25_sum_25000 | 0 | 1 | 5.67 | 2.110000e+00 | 0.00 | 4.66 | 5.91 | 7.28 | 9.650000e+00 | ▂▂▇▇▃ |
| log_nei_2008_pm10_sum_10000 | 0 | 1 | 4.35 | 2.320000e+00 | 0.00 | 2.69 | 4.62 | 6.07 | 9.340000e+00 | ▅▅▇▇▂ |
| log_nei_2008_pm10_sum_15000 | 0 | 1 | 5.10 | 2.180000e+00 | 0.00 | 3.87 | 5.39 | 6.72 | 9.710000e+00 | ▂▃▇▇▂ |
| log_nei_2008_pm10_sum_25000 | 0 | 1 | 6.07 | 2.010000e+00 | 0.00 | 5.10 | 6.37 | 7.52 | 9.880000e+00 | ▁▂▆▇▃ |
| popdens_county | 0 | 1 | 551.76 | 1.711510e+03 | 0.26 | 40.77 | 156.67 | 510.81 | 2.682191e+04 | ▇▁▁▁▁ |
| popdens_zcta | 0 | 1 | 1279.66 | 2.757490e+03 | 0.00 | 101.15 | 610.35 | 1382.52 | 3.041884e+04 | ▇▁▁▁▁ |
| nohs | 0 | 1 | 6.99 | 7.210000e+00 | 0.00 | 2.70 | 5.10 | 8.80 | 1.000000e+02 | ▇▁▁▁▁ |
| somehs | 0 | 1 | 10.17 | 6.200000e+00 | 0.00 | 5.90 | 9.40 | 13.90 | 7.220000e+01 | ▇▂▁▁▁ |
| hs | 0 | 1 | 30.32 | 1.140000e+01 | 0.00 | 23.80 | 30.75 | 36.10 | 1.000000e+02 | ▂▇▂▁▁ |
| somecollege | 0 | 1 | 21.58 | 8.600000e+00 | 0.00 | 17.50 | 21.30 | 24.70 | 1.000000e+02 | ▆▇▁▁▁ |
| associate | 0 | 1 | 7.13 | 4.010000e+00 | 0.00 | 4.90 | 7.10 | 8.80 | 7.140000e+01 | ▇▁▁▁▁ |
| bachelor | 0 | 1 | 14.90 | 9.710000e+00 | 0.00 | 8.80 | 12.95 | 19.22 | 1.000000e+02 | ▇▂▁▁▁ |
| grad | 0 | 1 | 8.91 | 8.650000e+00 | 0.00 | 3.90 | 6.70 | 11.00 | 1.000000e+02 | ▇▁▁▁▁ |
| pov | 0 | 1 | 14.95 | 1.133000e+01 | 0.00 | 6.50 | 12.10 | 21.22 | 6.590000e+01 | ▇▅▂▁▁ |
| hs_orless | 0 | 1 | 47.48 | 1.675000e+01 | 0.00 | 37.92 | 48.65 | 59.10 | 1.000000e+02 | ▁▃▇▃▁ |
| urc2013 | 0 | 1 | 2.92 | 1.520000e+00 | 1.00 | 2.00 | 3.00 | 4.00 | 6.000000e+00 | ▇▅▃▂▁ |
| urc2006 | 0 | 1 | 2.97 | 1.520000e+00 | 1.00 | 2.00 | 3.00 | 4.00 | 6.000000e+00 | ▇▅▃▂▁ |
| aod | 0 | 1 | 43.70 | 1.956000e+01 | 5.00 | 31.66 | 40.17 | 49.67 | 1.430000e+02 | ▃▇▁▁▁ |
Before diving deep into the data, it is important to see the distributions of the monitors. I first count the occurence of each state. Then I ordered them from state that has the least amount of monitors to the state that has the most.
pm |>
count(state) |>
mutate(state = reorder(state, n)) |>
ggplot(aes(x = state, y = n)) +
geom_col() +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
labs(title = "Distribution of monitors is uneven",
subtitle = "Number of Monitors in Each State",
x = "State",
y = "Number of Monitors")
California stands out with the highest count, nearing 90 monitors, suggesting a significant emphasis on air quality monitoring likely due to its large population and environmental concerns. On the other end of the spectrum, states such as Maine, the District of Columbia, and North Dakota have the fewest monitors, possibly reflecting lower population densities, smaller area or different prioritization in environmental monitoring. Other states like Ohio, Texas, and Pennsylvania exhibit a higher concentration of monitors, each with counts ranging from 40 to 60, which may point to their considerable industrial activities and associated air quality management efforts. The rest of the states display moderate numbers, with most having between 10 to 30 monitors. The distribution reflects a diversity of monitoring intensities, which can be attributed to various factors including state size, population density, industrial presence, and environmental policy initiatives.
When we employ machine learning algorithm, we want to avoid including redundant variables as they slow down the training process. To better understand the dataset and their correlation, we create a correlation heat map to see the relationship between each variable. In a correlation heatmap, each cell shows the correlation coefficient between two variables.The color and size of each circle represent the magnitude and direction of the correlation (red for positive, blue for negative). A correlation coefficient close to 1 or -1 indicates a strong positive or negative correlation, respectively, while a coefficient around 0 suggests no linear relationship.
PM_cor <- cor(pm %>% dplyr::select_if(is.numeric))
corrplot::corrplot(PM_cor, tl.cex = 0.5)
The diagonal line of dark blue squares represents the perfect positive correlation of each variable with itself (correlation coefficient of 1). Take out the diagonal of the graph, we can actually see some very useful information that might help us to determine the correlation we would like to investigate.
For example, the variables related to the impervious surface measures at different radius (imp_a500, imp_a1000, etc.) seem to be strongly positively correlated with each other, which is expected since these measures are likely to be similar within the same geographical locations. Variables representing the logged lengths of primary roads within various radius around monitoring stations (5000, 10000, 15000, and 25000 meters) are also highly correlated. This is expected since these metrics are different scopes of the road density.
Similar to the primary road length variables, the logged counts of primary and secondary road lengths within different radius (5000, 10000, 15000, and 25000 meters) around the monitoring stations are also showing a strong positive correlation. These variables quantify the combined lengths of both major and minor roads around each monitor and are expected to be correlated because they all expand upon the same underlying feature. The positive correlation observed between log_prisec_length_500 and log_prisec_length_1000 suggests a close relationship in the road network characteristics within a 1-kilometer radius from the monitors, indicating that these variables capture a similar scope of the urban infrastructure at a localized level. In contrast, their lack of strong correlation with the road length measures at larger radius implies a divergence in road network density.
The strong positive correlations among the log_nei_2008_pm25_sum and log_nei_2008_pm10_sum variables across the 10,000, 15,000, and 25,000-meter radius measurements suggest that emission sources for PM2.5 and PM10 have a consistent distribution across these areas. As these variables represent the logged sums of emissions from significant sources within the specified radius around the monitors, the findings imply a relative uniformity in the spatial spread of emission sources over these distances.
The strong positive correlation between popdens_county and popdens_zcta indicates that the population density within the counties and the corresponding Zip Code Tabulation Areas (ZCTAs) tend to increase and decrease together. This relationship suggests that areas with high population density at the county level also exhibit high density at the more localized ZCTA level, and vice versa.
The high positive correlation between urc2013 and urc2006 — both urban-rural classification systems for counties — suggests that the urban or rural character of most counties remained relatively stable between 2006 and 2013. However, their strong negative correlation with other variables like population density, road lengths, and emissions data implies that as areas become more urban (lower values in urc2013 and urc2006), factors such as population density, road infrastructure, and emissions tend to increase and vice versa. This negative correlation aligns with general urban planning and environmental patterns, where urban areas are typically characterized by higher population densities, more extensive road networks, and potentially greater emissions due to higher industrial and vehicular activity, compared to rural areas.
Now that we have found some variables strongly correlated with each other, we decide to draw a scatterplot matrix of these variables. This scatter plot matrix provides a pairwise comparison of the variables. On the diagonal are density plots for each variable, which show the distribution of each individual variable’s values. On the lower triangle are scatterplots for each pair of variables. On the upper triangle are the correlation coefficients for each pair of variables. We would do it on impervious surface measure variables first. Therefore, we select all variables that contains “imp” and use GGally package to make the scatterplot matrix.
select(pm, contains("imp")) %>%
GGally::ggpairs()
From the plot we can see that all variables are strongly positively correlated as their correlation values are all positive and greater than 0.5. Among these values, variables imp_a1000 and imp_a500 have a particularly high correlation, with a correlation value of 0.973. Similarly, imp_a10000 and imp_a15000 also show a very strong correlation, with a correlation value of 0.968.
Next, we would like to see the scatterplot matrix of emissions variables. Similarly, we use select function to select all the data that contain “nei” and draw the matrix. However, the existing emissions variable names are too long for the small plot. Hence, we decide to replace the existing names with new, shorter variable names, simplifying them by omitting elements like “log”, “2008”, and “sum” from the name. For example, the new name PM2.5_10k, derived from the original log_nei_2008_pm25_sum_10000, represents the logged sum of PM2.5 emissions within a 10,000-meter radius from the monitoring stations, based on 2008 National Emissions Inventory data. We map these new labels to their corresponding original variable names. we use the select function from to create a new dataframe, pm_selected, which contains only the columns specified in long_names and then generate the ggpairs plot using the new labels. Additionally, we adjust the title text size in the plot to ensure clarity and readability.
# Existing long variable names
long_names <- c("log_nei_2008_pm25_sum_10000",
"log_nei_2008_pm25_sum_15000",
"log_nei_2008_pm25_sum_25000",
"log_nei_2008_pm10_sum_10000",
"log_nei_2008_pm10_sum_15000",
"log_nei_2008_pm10_sum_25000")
# Shorter variable names
new_labels <- c("PM2.5_10k", "PM2.5_15k", "PM2.5_25k", "PM10_10k", "PM10_15k", "PM10_25k")
# Map the new labels to the existing names
names(new_labels) <- long_names
# select data that contains the long names.
pm_selected <- pm %>% select(all_of(long_names))
# Plot
ggpairs_plot <- GGally::ggpairs(pm_selected, columnLabels = new_labels)
# make the title text smaller
ggpairs_plot +
theme(strip.text = element_text(size = 8)) # Adjust the size of the title
The variables related to emissions (nei) demonstrate strong positive correlations with one another, as indicated by high correlation coefficients (most above 0.7). Variables representing PM2.5 and PM10 emissions within the same radius have the highest correlation values. This is expected since PM2.5 particles are a subset of PM10, meaning all PM2.5 emissions are also counted as PM10. Therefore, within the same radius, their emission quantities are intrinsically linked, leading to a strong correlation between these variables.
Finally, we plot the correlation matrix plot of variables including log distance to a primary or secondary road from the monitor, log primary road length at different radius from the monitors(5000 meters, 10000 meters, 15000 meters and 25000 meters) and log primary and secondary road length at different radius from the monitors (500 meters, 1000 meters, 5000 meters, 10000 meters, 15000 meters and 25000 meters). We use the select function to select all columns from the pm dataset that include “pri”. We then customize the color filling to keep it consistent with the previous plot.
select(pm, contains("pri")) %>%
GGally::ggcorr(hjust = .85, size = 3,
layout.exp=2, label = TRUE) +
scale_fill_gradient2(low = "#f43414", high = "#469db4", mid = "white", midpoint = 0, limit = c(-1, 1), space = "Lab", name="Correlation")
## Scale for fill is already present.
## Adding another scale for fill, which will replace the existing scale.
Distance to Primary/Secondary Roads (Log_dist_to_prisec) is negatively correlated with all other variables, meaning as the distance to primary or secondary roads increases, the length of roads within a specified radius tends to decrease. This is intuitive as monitors farther from roads will have fewer roads within any given radius. The log-transformed counts of primary road lengths, measured within different radius around a monitor, are positively correlated with each other, indicating that areas with more primary roads within a smaller radius tend to have more within a larger radius as well. This pattern is similar for the combined lengths of primary and secondary roads within these radius. As the radius increases, the lengths of primary roads are strongly correlated with the lengths of both primary and secondary roads, suggesting that in places with many primary roads, you’re likely to find many secondary roads as well.
Next, we are going to make a random forest model to predict annual average air pollution concentrations in the U.S. at the zip code level based features in the dataset. Random forest is a machine learning technique that works by combining the predictions from multiple decision trees to make more accurate and reliable predictions. Each decision tree is like a “vote,” partitioning data into different subsets. By aggregating these votes, random forest gives us a more robust result. It’s especially useful when dealing with complex data and can handle both numerical and categorical variables.
First, we are changing the city category to only have two categories: “Not in a city” and “In a city” using the case_when function. This is because there are more than 600 different cities and random forest package can not deal with variables that have more than 53 categories. This way, we can see if being in a city or not being in a city can affect air pollution. In addition, we are going to randomly split the data into training and testing in order to evaluate the performance of the model. We are going to use two thirds of the data to train the model and the last third to test the model.
pm <- pm |>
mutate(city = case_when(city == "Not in a city" ~ "Not in a city",
city != "Not in a city" ~ "In a city"))
set.seed(1234)
pm_split <- rsample::initial_split(data = pm, prop = 2/3)
pm_split
## <Training/Testing/Total>
## <584/292/876>
We split the dataset into two subsets: a training set and a testing set, represented by train_pm and test_pm, respectively. In machine learning, we typically divide our data into two sets: a training dataset and a testing dataset. The training dataset is used to train the model, allowing it to learn patterns and relationships within the data. The testing dataset is then used to evaluate the model’s performance and assess its ability to make accurate predictions on new, unseen data. This separation helps us determine how well the model generalizes from the training data to make predictions on real-world data it has never encountered before
To get an overview of the distribution of observations across different states in the training set, we used the count function on the train_pm dataset.
train_pm <- rsample::training(pm_split)
test_pm <- rsample::testing(pm_split)
The proportion of states in the training dataset is representative of the overall monitor distribution. This ensures that the model is exposed to a diverse range of geographic locations. It contains a significant number of monitors from California, with 55 observations. This likely reflects the fact that California is a large and populous state with various monitoring locations.
we’re utilizing a piece of code that plays a crucial role in preparing our data for analysis. This code is based on the concept of a “recipe,” which in data science terms refers to a series of steps for processing data. It’s similar to a cooking recipe, where we follow specific instructions to prepare a dish, but here, our ingredients are data, and our final dish is a well-prepared dataset ready for modeling.
We begin by creating a recipe object, RF_rec, using our training dataset train_pm. We start by setting every column in our dataset as a “predictor” using update_role(everything(), new_role = “predictor”), meaning we initially consider all columns as potential input variables for our model. However, we then specify that the value column is our “outcome” (the target variable we want to predict), and we assign special roles to id and fips as an “id variable” and “county id,” respectively because the numbers of id and county do not provide intrinsic information for our predictive modeling; they are identifiers rather than predictors . The step_novel(“state”) function is essential for addressing any new state values that may appear in future datasets. It’s also important for cross-validation processes (which will be specified later), where different levels in each fold’s test and training sets need to be accommodated seamlessly. We also convert the “state,” “county,” and “city” variables from character to factor type. We are removing county, the zip code tabulation area (zcta) because these two categorical variables have more than 53 levels, surpassing the compatibility limit of our package. The final steps, step_corr(all_numeric()) and step_nzv(all_numeric()), focus on enhancing the predictor variables. The function step_corr eliminates highly correlated numeric predictors, as these can bring noise to our model and slow down training process. We also use step_nzv to remove numeric predictors with near-zero variance, which are unlikely to be significant for our predictive model.
RF_rec <- recipe(train_pm) %>%
update_role(everything(), new_role = "predictor")%>%
update_role(value, new_role = "outcome")%>%
update_role(id, new_role = "id variable") %>%
update_role("fips", new_role = "county id") %>%
step_novel("state") %>%
step_string2factor("state", "county", "city") %>%
step_rm("county") %>%
step_rm("zcta") %>%
step_corr(all_numeric())%>%
step_nzv(all_numeric())
We then initiated a Random Forest model using rand_forest() with two key parameters: mtry and min_n. In Random Forest, mtry represents the number of variables considered at each split in the decision tree, while min_n is the minimum number of data points in a tree’s node required to attempt a split. By setting these as tune(), we indicate that these parameters should be optimized through a tuning process rather than being pre-set. This means that rather than sticking with default or guessed values, we actively seek out the best possible values for mtry and min_n based on our specific dataset and modeling goals. We then specify the type of model is Random Forest algorithm and set its mode to regression.
tune_RF_model <- rand_forest(mtry = tune(), min_n = tune()) %>%
set_engine("randomForest") %>%
set_mode("regression")
tune_RF_model
## Random Forest Model Specification (regression)
##
## Main Arguments:
## mtry = tune()
## min_n = tune()
##
## Computational engine: randomForest
The RF_tune_wflow object is a workflow that combines our data preparation recipe and the Random Forest model for tuning. We create it using the workflow() function from the workflows package in R. First, we add our data preparation recipe (RF_rec) to the workflow with add_recipe(RF_rec). This recipe includes all our data preprocessing steps.
Next, we add our Random Forest model (tune_RF_model) to the workflow with add_model(tune_RF_model). This model is set up for tuning, and it’s specified for regression tasks.
The final RF_tune_wflow object is our complete setup for data processing and model tuning in one streamlined workflow. It ensures that our data is correctly preprocessed before being used in the Random Forest model for tuning.
RF_tune_wflow <- workflows::workflow() %>%
workflows::add_recipe(RF_rec) %>%
workflows::add_model(tune_RF_model)
RF_tune_wflow
## ══ Workflow ════════════════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: rand_forest()
##
## ── Preprocessor ────────────────────────────────────────────────────────────────
## 6 Recipe Steps
##
## • step_novel()
## • step_string2factor()
## • step_rm()
## • step_rm()
## • step_corr()
## • step_nzv()
##
## ── Model ───────────────────────────────────────────────────────────────────────
## Random Forest Model Specification (regression)
##
## Main Arguments:
## mtry = tune()
## min_n = tune()
##
## Computational engine: randomForest
We then use n_cores variable to identify the number of cores available in our computer’s processor. We determine this using the parallel::detectCores() function from the parallel package in R.
This function checks and returns the total number of available CPU cores in our machine. By storing this number in the n_cores variable, we can use this information to optimize our computational tasks.
# making the tuning faster
n_cores <- parallel::detectCores()
n_cores
## [1] 8
In this part of our project, we’re setting up parallel processing and initiating a model tuning process.
First, we use doParallel::registerDoParallel(cores = n_cores) to enable parallel processing in R. By passing n_cores as an argument, we instruct R to use all 8 cores to perform tasks in parallel, which can significantly speed up cross-validation and grid search in model tuning.
Next, we set a random seed with set.seed(123). This is important for reproducibility, ensuring that any random processes within our analysis (like data splitting) can be repeated exactly in future runs.
We then create cross-validation folds for our training data using rsample::vfold_cv(data = train_pm, v = 4). This function from the rsample package divides our train_pm dataset into four folds (as indicated by v = 4) for cross-validation. Cross-validation is a technique used to assess the model’s performance and is particularly useful to avoid overfitting. By setting v = 4, we ensure that our model’s performance is tested on four different subsets of our data.
Finally, we perform the actual model tuning using tune_grid(object = RF_tune_wflow, resamples = vfold_pm, grid = 20). This function tunes the RF_tune_wflow workflow (which contains our Random Forest model and preprocessing steps) across different combinations of parameters. The grid = 20 argument specifies that 20 different combinations of mtry and min_n values (the parameters we set to tune) will be tested. The resamples = vfold_pm argument indicates that the tuning will be performed using the cross-validation folds we created earlier.
doParallel::registerDoParallel(cores = n_cores)
set.seed(123)
vfold_pm <- rsample::vfold_cv(data = train_pm, v = 4)
tune_RF_results <- tune_grid(object = RF_tune_wflow, resamples = vfold_pm, grid = 20)
## i Creating pre-processing data to finalize unknown parameter: mtry
We then extract the best model configuration from our tuning results based on a specific performance metric. In this case, we’re focusing on the Root Mean Square Error. The parameter metric = “rmse” specifies that we want to evaluate the models based on their RMSE. RMSE is a standard measure in regression tasks that quantifies the average difference between the predicted values and the actual values. A lower RMSE indicates a better fit of the model to the data. The parameter n = 1 indicates that we want to see the top 1 result.
show_best(tune_RF_results, metric = "rmse", n = 1)
## # A tibble: 1 × 8
## mtry min_n .metric .estimator mean n std_err .config
## <int> <int> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 25 5 rmse standard 1.72 4 0.150 Preprocessor1_Model07
After tuning and using v-fold cross validation, the model with the lowest RMSE has 25 features that will be sampled at each split and a minimum of 5 data points in a node before it continues to split. We identify the optimal parameter values for our Random Forest model based on the tuning results.
tuned_RF_values<- select_best(tune_RF_results, "rmse")
We apply the best-tuned parameters, stored in tuned_RF_values, to our existing workflow RF_tune_wflow. We then fit the model on the training set and evaluates it on the validation set, providing us with the final performance metrics. Here, ‘overallfit’ is generated by fitting RF_tuned_wflow to the training data and evaluating it on the testing data, as specified in pm_split. Finally, we extract the most important features using vip::vip and collect performance metrics using collect_metrics, which gives us a comprehensive view of our model’s effectiveness, including its accuracy and the significance of different predictors in the model. We set num_features to be 3 since we want the top 3 most important features.
RF_tuned_wflow <-RF_tune_wflow %>%
tune::finalize_workflow(tuned_RF_values)
overallfit <- RF_tuned_wflow %>%
tune::last_fit(pm_split)
overallfit %>%
extract_fit_parsnip() %>%
vip::vip(num_features = 3)
collect_metrics(overallfit)
## # A tibble: 2 × 4
## .metric .estimator .estimate .config
## <chr> <chr> <dbl> <chr>
## 1 rmse standard 1.81 Preprocessor1_Model1
## 2 rsq standard 0.544 Preprocessor1_Model1
After tuning the model and seeing how it performs on the test data, it has a similar RMSE to the training data. The test RMSE being 1.81 and the training RMSE being 1.71. This means that the model is not overfit to the training data and is good for generalizing. In addition, the top three features that contribute to annual air pollution are the state that the monitor is in, estimated values of air pollution from a computational model (CMAQ), and land area of the county of the monitor.
We then collect the predictions made by our tuned Random Forest model on the testing data.
test_predictions <- collect_predictions(overallfit)
We have developed a random forest model to predict the levels of air pollution. We want to create a map to visualize our predictions and compare them with the actual observed values.
we are creating a map to visualize data points related to air pollution monitors. We fetch a map of countries using the ne_countries function from the rnaturalearth package. The map is set at a “medium” scale, providing a balance between detail and performance. The returnclass = “sf” argument specifies that the data should be returned as an ‘sf’ object, which is suitable for spatial plotting in R.
Then, we use ggplot to start plotting our map. The geom_sf() function adds the spatial data (the map of the world) to the plot. The coord_sf() function is then used to set the map’s coordinates, focusing on a specific region. In this case, the xlim and ylim arguments are set to display the geographical area covering the United States.
We then add data points from the pm dataset onto the map. These points represent the locations of air pollution monitors. The aes(x = lon, y = lat) argument specifies that the longitude (lon) and latitude (lat) columns from the pm dataset should be used to place the points on the map. The size, shape, and color of the points (dark red) are specified to make them visible and distinct.
world <- ne_countries(scale = "medium", returnclass = "sf")
ggplot(data = world) +
geom_sf() +
coord_sf(xlim = c(-125, -66), ylim = c(24.5, 50),
expand = FALSE)+
geom_point(data = pm, aes(x = lon, y = lat), size = 2,
shape = 23, fill = "darkred")
We then retrieve county boundary data using the maps package and convert it into a simple features (sf) object. The map function fetches data for U.S. counties (“county”), and st_as_sf from the sf package converts this data into a format suitable for plotting with ggplot2.
counties <- sf::st_as_sf(maps::map("county", plot = FALSE,
fill = TRUE))
monitors <- ggplot(data = world) +
geom_sf(data = counties, fill = NA, color = gray(.5))+
coord_sf(xlim = c(-125, -66), ylim = c(24.5, 50),
expand = FALSE) +
geom_point(data = pm, aes(x = lon, y = lat), size = 2,
shape = 23, fill = "darkred") +
ggtitle("Monitor Locations") +
theme(axis.title.x=element_blank(),
axis.text.x = element_blank(),
axis.ticks.x = element_blank(),
axis.title.y = element_blank(),
axis.text.y = element_blank(),
axis.ticks.y = element_blank())
monitors
Now, we would like to color our map based on the air pollution levels. To add a county-level fill to our map based on the monitor values, we need to merge our counties dataset with pm datsaset. We first need to ensure that the county names in our datasets are formatted consistently for accurate merging.
head(counties)
## Simple feature collection with 6 features and 1 field
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: -88.01778 ymin: 30.24071 xmax: -85.06131 ymax: 34.2686
## Geodetic CRS: +proj=longlat +ellps=clrk66 +no_defs +type=crs
## ID geom
## alabama,autauga alabama,autauga MULTIPOLYGON (((-86.50517 3...
## alabama,baldwin alabama,baldwin MULTIPOLYGON (((-87.93757 3...
## alabama,barbour alabama,barbour MULTIPOLYGON (((-85.42801 3...
## alabama,bibb alabama,bibb MULTIPOLYGON (((-87.02083 3...
## alabama,blount alabama,blount MULTIPOLYGON (((-86.9578 33...
## alabama,bullock alabama,bullock MULTIPOLYGON (((-85.66866 3...
As we cann see, the county data (counties) includes county names that follow the state names, separated by a comma, and are all in lowercase. To match this format with the county names in our air pollution data (pm), we need to ensure that both datasets have the county names formatted in the same way.
We use the separate function from the tidyr package to split the ID column in the counties dataset into two new columns: state and county. After this operation, counties will have two separate columns for state and county names. We the mutate the county column using str_to_title() to make them all lower-cased.
In the final step, we perform innerjoin between the modified counties dataset and the pm dataset. The join is based on the county column. The result, map_data, is a new dataset that combines information from both counties and pm, effectively linking the geographical data of counties with the air pollution data from pm.
counties <- counties %>%
tidyr::separate(ID, into = c("state", "county"), sep = ",") %>%
dplyr::mutate(county = stringr::str_to_title(county))
map_data <- dplyr::inner_join(counties, pm, by = "county")
## Warning in sf_column %in% names(g): Detected an unexpected many-to-many relationship between `x` and `y`.
## ℹ Row 4 of `x` matches multiple rows in `y`.
## ℹ Row 2 of `y` matches multiple rows in `x`.
## ℹ If a many-to-many relationship is expected, set `relationship =
## "many-to-many"` to silence this warning.
We then create a geographical map visualization displaying true air pollution levels across various counties.
We first draw the map of Unietd States using the “world” dataset. Then we use geom_sf to add the map_data to the plot. The counties are filled based on the value column from map_data, which represents the air pollution levels. Finally, we define a gradient color scale for visualizing PM 2.5 levels, using topo.colors. The na.value = “transparent” makes areas with no data transparent.
truth <- ggplot(data = world) +
coord_sf(xlim = c(-125,-66),
ylim = c(24.5, 50),
expand = FALSE) +
geom_sf(data = map_data, aes(fill = value)) +
scale_fill_gradientn(colours = topo.colors(7),
na.value = "transparent",
breaks = c(0, 10, 20),
labels = c(0, 10, 20),
limits = c(0, 23.5),
name = "PM ug/m3") +
ggtitle("True PM 2.5 levels") +
theme(axis.title.x = element_blank(),
axis.text.x = element_blank(),
axis.ticks.x = element_blank(),
axis.title.y = element_blank(),
axis.text.y = element_blank(),
axis.ticks.y = element_blank())
## Coordinate system already present. Adding new coordinate system, which will
## replace the existing one.
truth
After that, we are creating a map of our predicted air pollution levels.
We first fit the RF_tuned_wflow model to the training data (train_pm) and testing data (test_pm). This results in two fitted models: RF_final_train_fit for the training data and RF_final_test_fit for the testing data.
Then, we make predictions on both the training and testing datasets using their respective fitted models. The predicted values are then combined (bind_cols) with the original datasets to include relevant columns like value, fips, county, and id.The predictions from both the training and testing sets are combined into a single dataset all_pred.
Similar to the previous step, a map is created using ggplot, with the focus on the United States. The geom_sf function adds the map_data to the plot, filling the counties based on the predicted PM 2.5 levels (.pred).
# fit data
RF_final_train_fit <- parsnip::fit(RF_tuned_wflow, data = train_pm)
RF_final_test_fit <- parsnip::fit(RF_tuned_wflow, data = test_pm)
# get predictions on training data
values_pred_train <- predict(RF_final_train_fit, train_pm) %>%
bind_cols(train_pm %>% select(value, fips, county, id))
# get predictions on testing data
values_pred_test <- predict(RF_final_test_fit, test_pm) %>%
bind_cols(test_pm %>% select(value, fips, county, id))
values_pred_test
## # A tibble: 292 × 5
## .pred value fips county id
## <dbl> <dbl> <fct> <chr> <fct>
## 1 11.5 11.2 1033 Colbert 1033.1002
## 2 12.1 12.4 1055 Etowah 1055.001
## 3 11.0 10.5 1069 Houston 1069.0003
## 4 14.2 15.6 1073 Jefferson 1073.0023
## 5 12.2 12.4 1073 Jefferson 1073.1005
## 6 11.3 11.1 1073 Jefferson 1073.1009
## 7 11.6 11.8 1073 Jefferson 1073.5003
## 8 10.9 10.0 1097 Mobile 1097.0003
## 9 11.9 12.0 1101 Montgomery 1101.0007
## 10 12.8 13.2 1113 Russell 1113.0001
## # ℹ 282 more rows
# combine
all_pred <- bind_rows(values_pred_test, values_pred_train)
map_data <- inner_join(counties, all_pred, by = "county")
## Warning in sf_column %in% names(g): Detected an unexpected many-to-many relationship between `x` and `y`.
## ℹ Row 4 of `x` matches multiple rows in `y`.
## ℹ Row 164 of `y` matches multiple rows in `x`.
## ℹ If a many-to-many relationship is expected, set `relationship =
## "many-to-many"` to silence this warning.
pred <- ggplot(data = world) +
coord_sf(xlim = c(-125,-66),
ylim = c(24.5, 50),
expand = FALSE) +
geom_sf(data = map_data, aes(fill = .pred)) +
scale_fill_gradientn(colours = topo.colors(7),
na.value = "transparent",
breaks = c(0, 10, 20),
labels = c(0, 10, 20),
limits = c(0, 23.5),
name = "PM ug/m3") +
ggtitle("Predicted PM 2.5 levels") +
theme(axis.title.x=element_blank(),
axis.text.x=element_blank(),
axis.ticks.x=element_blank(),
axis.title.y=element_blank(),
axis.text.y=element_blank(),
axis.ticks.y=element_blank())
## Coordinate system already present. Adding new coordinate system, which will
## replace the existing one.
pred
Here, we are getting the average absolute difference between the models predicted values and the true values in order to get an idea of the models prediction accuracy
avg_abs_difference <- abs(values_pred_test[,'.pred'] - values_pred_test[,'value'])
mean(avg_abs_difference[,'.pred'])
## [1] 0.4658079
The output 0.4658079 indicates that the average absolute difference between the predicted and actual PM 2.5 levels in the test data is approximately 0.466. The smaller the number, the closer the model’s predictions are to the actual values. Therefore, our model’s predictions are quite accurate.
Finally, we are putting the maps of true air pollution values and our predicted values together.
(truth/pred) +
plot_annotation(title = "Machine Learning Methods Allow for Prediction of Air Pollution", subtitle = "A random forest model predicts true monitored levels of fine particulate matter (PM 2.5) air pollution based on\ndata about population density and other predictors reasonably well, thus suggesting that we can use similar methods to predict levels\nof pollution in places with poor monitoring",
theme = theme(plot.title = element_text(size =12, face = "bold"),
plot.subtitle = element_text(size = 8)))
Based on the map comparison provided, we can observe that the predicted air pollution levels generated by our model are quite similar to the actual monitored values. The color patterns on both maps, representing the concentration of PM 2.5, show a good degree of consistency, which suggests our model’s predictions align well with reality.
However, one noticeable difference is that the predicted map appears to have fewer instances of extreme values. The actual map shows more areas with very low (dark blue) and very high (yellow) air pollution levels. In contrast, the predicted map seems to have a more moderate range of values. This could indicate that while our model is effective in capturing general trends, it may be somewhat conservative in predicting the more extreme levels of air pollution. This could be due to various reasons, such as the model possibly smoothing over outliers or the absence of sufficient training data at the extremes.
In summary, through our exploratory data analysis we found multiple variables with strong correlations with each other and through that found that there was a way to predict the annual average air pollution concentration in the US. While exploring our data we found that California had a vast majority of the monitors in the US which we believe was due to the states much larger population and possible larger concern for the environment. On the other hand states such as Maine and North Dakota had significantly less monitors reflecting the importance of monitors in states of higher population densities.
In order to begin answering our research question, we fitted a random forest model to do such. We ran V fold cross validation to find the ideal variables to run for the prediction along with the ideal number of variables and nodes to use. After successfully training and fitting the model the test results had an average difference from the true values of 0.466 which was very accurate. Our results from the model showed that we could even use it to predict levels of pollution in areas with low monitoring.
With the data we have, there is a high chance we can accurately predict US annual average air pollution concentrations at the granularity of zip code regional levels. Taking a look at the graph of our models predictions compared to the true values, it can be seen that our model was extremely accurate in its predictions from the east all through to the west of the US. While the chances of an accurate prediction are high there is always the possibility of the model we used being biased to our data set. Even though our model in general had very good results, these could always be better with more in depth data along with better model methods or even a better choice of model being used. In conclusion, yes we can predict the US annual average air pollution concentrations at the granularity of zip code regional levels.
It has been observed that monitoring stations for PM2.5 are predominantly located in specific regions, notably in areas like California. This geographical concentration raises an intriguing question about whether the presence of air quality monitoring stations correlate with the economic prosperity of a region as measured by its Gross Domestic Product (GDP).To explore this hypothesis, datasets encompassing GDP figures for counties across the entire U.S. have been compiled and analyzed.
The core objective is to see whether there is a significant difference in the GDPs of counties that have PM2.5 monitors compared to those that do not. This comparison could bring insights into the economic development of areas prioritized for environmental monitoring and potentially improve environmental policies.
To examine whether there’s a significant difference in GDP between counties with PM2.5 air quality monitors and those without, we are importing data from the Bureau of Economic Analysis 15. We will read the dataset using read_excel function and name this dataset as countygdp. This dataset is organized by state, listed in alphabetical order starting with Alabama. The relevant data for our analysis, which includes the GDP information for each county, is found between the 6th and 3223rd rows so we will specify that by setting the range to be from 6th row to 3223rd row.
countygdp <- read_excel("lagdp1223.xlsx",
range = cell_rows(6:3223),
col_names = TRUE)
## New names:
## • `` -> `...1`
## • `` -> `...2`
## • `` -> `...3`
## • `` -> `...4`
## • `` -> `...5`
## • `` -> `...6`
## • `` -> `...7`
## • `` -> `...8`
## • `` -> `...9`
## • `` -> `...10`
## • `` -> `...11`
countygdp
## # A tibble: 3,217 × 11
## ...1 ...2 ...3 ...4 ...5 ...6 ...7 ...8 ...9 ...10 ...11
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 Autauga 1730861 1722438 1727818 1929264 23 -0.5 0.3 11.7 1 Alab…
## 2 Baldwin 8148786 8102009 8738819 8924207 7 -0.6 7.9 2.1 27 Alab…
## 3 Barbour 762557 731636 747888 745349 42 -4.09… 2.20… -0.3 43 Alab…
## 4 Bibb 441359 460018 458049 448464 52 4.2 -0.4 -2.1 51 Alab…
## 5 Blount 1014620 907179 1025600 1045845 34 -10.6 13.1 2 29 Alab…
## 6 Bullock 266230 268773 258126 247403 63 1 -4 -4.2 60 Alab…
## 7 Butler 638217 603301 634002 652201 43 -5.5 5.09… 2.9 19 Alab…
## 8 Calhoun 4555074 4461364 4624647 4608415 12 -2.1 3.7 -0.4 44 Alab…
## 9 Chambers 938023 880665 871486 894548 37 -6.1 -1 2.6 23 Alab…
## 10 Cherokee 575624 536093 595234 606899 45 -6.9 11 2 30 Alab…
## # ℹ 3,207 more rows
The dataset from the Bureau of Economic Analysis offers a comprehensive view of county-level economic data across the United States, covering 3217 rows and 11 columns. Each row represents a distinct county. The dataset includes the following information:
County Name (1st Column): Identifies each county.
GDP from 2019 to 2022 (2nd to 5th Columns): Provides yearly GDP data, giving insights into economic trends over four years.
GDP Rank in the State (6th Column): Shows the rank of each county’s GDP within its state.
GDP Percentage Change from 2020 to 2022 (7th to 9th Columns): Indicates the annual percentage change in GDP, offering a measure of economic growth or decline.
Rank of GDP Percentage Change in State (10th Column): Ranks the counties based on their GDP percentage change within the state.
State (11th Column): Specifies the state to which the county belongs.
For our analysis, we’ll focus on the most recent GDP data (2022), along with the county names and their respective states. We will use the select function to select the first, fifth and eleventh columns. We will rename these columns: county, gdp and state.
countygdp <- countygdp |>
select(1, 5, 11)
colnames(countygdp) <- c("county", "gdp", "state")
Given the previous findings that Alaska and Hawaii do not have PM2.5 air quality monitors, it is logical to exclude counties from these two states from our analysis. The following code is filtering out data from the countygdp dataset to exclude counties located in the states of Hawaii and Alaska.
countygdp <- countygdp |>
filter(state != "Hawaii" & state != "Alaska")
countygdp
## # A tibble: 3,046 × 3
## county gdp state
## <chr> <chr> <chr>
## 1 Autauga 1929264 Alabama
## 2 Baldwin 8924207 Alabama
## 3 Barbour 745349 Alabama
## 4 Bibb 448464 Alabama
## 5 Blount 1045845 Alabama
## 6 Bullock 247403 Alabama
## 7 Butler 652201 Alabama
## 8 Calhoun 4608415 Alabama
## 9 Chambers 894548 Alabama
## 10 Cherokee 606899 Alabama
## # ℹ 3,036 more rows
To check the number of distinct states in the countygdp dataset and compare it with the pm dataset, we use the distinct() function from the dplyr package.
countygdp |>
distinct(state)
## # A tibble: 49 × 1
## state
## <chr>
## 1 Alabama
## 2 Arizona
## 3 Arkansas
## 4 California
## 5 Colorado
## 6 Connecticut
## 7 Delaware
## 8 District of Columbia
## 9 Florida
## 10 Georgia
## # ℹ 39 more rows
After that we see there are 49 states just like our pm dataset. District of Columbia is counted as a separated state. Hawaii and Alaska are not included.
We are trying to combine two datasets into one, named monitor_county, containing details about counties with air quality monitors and their GDPs. We begin by unifying the format of county and state names in both datasets to lowercase using tolower() function, ensuring consistency for merging. Following this, the pm dataset was pared down to just the county and state columns using the select() function.
Then, a left_join function merged the pm dataset with countygdp, aligning on county and state names. This merge created the monitor_county dataset, which combined monitor information with GDP data. Conversely, the anti_join function was used to form the other_county dataset, encompassing counties in countygdp not found in pm. This effectively distinguished counties without air quality monitors.
The left_join function here was key to preserving all entries from the pm dataset, adding GDP information from countygdp where available. In contrast, anti_join identified and isolated counties unique to countygdp, aiding in creating a comprehensive analysis of counties with and without air quality monitoring.
# Converting to lowercase for both dataset
pm$county <- tolower(pm$county)
pm$state <- tolower(pm$state)
countygdp$county <- tolower(countygdp$county)
countygdp$state <- tolower(countygdp$state)
monitor_county <- pm |>
select(county, state)|>
left_join(countygdp, by = c("county", "state"))
other_county <- anti_join(countygdp, monitor_county, by = c("county", "state"))
Then we create a dataset named missing_gdp by filtering the monitor_county dataset to isolate rows where the GDP data is missing. This is accomplished using the filter function, which selects only those rows where the gdp column has NA (Not Available) values. The resulting missing_gdp dataset thus contains records from monitor_county for which the GDP information is absent, allowing for further investigation or handling of these missing data points.
missing_gdp <- monitor_county |>
filter(is.na(gdp))
missing_gdp
## # A tibble: 21 × 3
## county state gdp
## <chr> <chr> <chr>
## 1 saint clair illinois <NA>
## 2 saint clair illinois <NA>
## 3 baltimore (city) maryland <NA>
## 4 baltimore (city) maryland <NA>
## 5 baltimore (city) maryland <NA>
## 6 baltimore (city) maryland <NA>
## 7 baltimore (city) maryland <NA>
## 8 saint louis missouri <NA>
## 9 albemarle virginia <NA>
## 10 fairfax virginia <NA>
## # ℹ 11 more rows
The missing_gdp dataset revealed naming inconsistencies between the pm and countygdp datasets, causing missing GDP data. To fix this, county names in countygdp were standardized (e.g., changing “st. clair” to “saint clair” in Illinois). To address this, a mutate operation was applied to the countygdp dataset using case_when to standardize county names. After renaming, the pm dataset, narrowed down to county and state, was merged again with countygdp to create an updated monitor_county. Counties in countygdp not in pm were placed in other_county through an anti-join. A new missing_gdp dataset was then generated to identify any remaining missing GDP entries in monitor_county.
countygdp <- countygdp %>%
mutate(county = case_when(
county == "st. clair" & state == "illinois" ~ "saint clair",
county == "st. louis" & state == "missouri" ~ "saint louis",
TRUE ~ county
))
monitor_county <- pm |>
select(county, state)|>
left_join(countygdp, by = c("county", "state"))
other_county <- anti_join(countygdp, monitor_county, by = c("county", "state"))
missing_gdp <- monitor_county |>
filter(is.na(gdp))
missing_gdp
## # A tibble: 18 × 3
## county state gdp
## <chr> <chr> <chr>
## 1 baltimore (city) maryland <NA>
## 2 baltimore (city) maryland <NA>
## 3 baltimore (city) maryland <NA>
## 4 baltimore (city) maryland <NA>
## 5 baltimore (city) maryland <NA>
## 6 albemarle virginia <NA>
## 7 fairfax virginia <NA>
## 8 fairfax virginia <NA>
## 9 fairfax virginia <NA>
## 10 frederick virginia <NA>
## 11 rockingham virginia <NA>
## 12 bristol city virginia <NA>
## 13 hampton city virginia <NA>
## 14 lynchburg city virginia <NA>
## 15 norfolk city virginia <NA>
## 16 roanoke city virginia <NA>
## 17 roanoke city virginia <NA>
## 18 virginia beach city virginia <NA>
For the remaining entries in the missing_gdp dataset where GDP data could not be located, it is likely that this issue arises from the presence of monitors in independent cities not recognized as part of county jurisdictions. An example is Fairfax in Virginia, which is an independent city and not categorized under any county. Such independent cities may not be represented in the countygdp dataset that primarily focuses on county-level GDP data. As a result, these specific locations will continue to have missing GDP values in the monitor_county dataset.
Prior to conducting exploratory data analysis, the dataset monitor_county was refined to create monitor_county_clean. This refinement involved filtering out rows with missing GDP data, ensuring that the analysis would be based only on counties with complete economic information. Additionally, the GDP values in both monitor_county_clean and other_county datasets were converted to numeric types using as.numeric method. This conversion is essential for facilitating accurate statistical analysis, as it enables the application of various quantitative methods that require numerical data input.
monitor_county_clean <- monitor_county %>%
filter(!is.na(gdp))
monitor_county_clean$gdp <- as.numeric(monitor_county_clean$gdp)
other_county$gdp <- as.numeric(other_county$gdp)
The code summary provides descriptive statistics for the GDP values in the monitor_county_clean and other_county datasets. These summaries include the minimum, first quartile, median, mean, third quartile, and maximum GDP values for each dataset.
summary(monitor_county_clean$gdp)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 78859 4863260 17114492 56271268 52659909 790016435
summary(other_county$gdp)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 11446 319612 743910 1908299 1762218 85468316
These statistics show a wide range of GDP values, especially in monitor_county_clean, indicated by a high maximum value. The differences between the median and mean values suggest a skewed distribution, with some counties having significantly higher GDPs. The log transformation of the GDP data is needed due to the large range and skewness shown by the difference between the mean and median values. This transformation helps in three ways: it makes the data distribution more normal and symmetrical, which is important for many statistical analyses; it lessens the effect of very high or low values that can distort the average and spread of the data; and it reduces the size of very large GDP values, making it easier to compare and visualize the data across counties.
Therefore, we applied a logarithmic transformation to the GDP values in both the monitor_county_clean and other_county datasets. We store the log gdp in a new column named log_gdp. Additionally, to clearly distinguish between the two data sources, we introduced a new column labeled source in each dataset. monitor_county_clean was tagged as ‘Monitor County’, and `other_county’ as ‘Other County’. This labeling is key for identifying the origins of the data once combined. We then merged these datasets into a single dataset named combined_data using the rbind function. This unified dataset includes both groups of counties, those with and without air quality monitors, along with their respective log-transformed GDP values. We proceeded to create a boxplot from this combined dataset using ggplot package, visually comparing the log-transformed GDP distributions of counties with monitors against those without.
# Adding log-transformed GDP
monitor_county_clean$log_gdp <- log(monitor_county_clean$gdp)
other_county$log_gdp <- log(other_county$gdp)
monitor_county_clean$source <- 'Monitor County'
other_county$source <- 'Other County'
# Combine the data again
combined_data <- rbind(monitor_county_clean, other_county)
# Create the boxplot with log-transformed GDP
ggplot(combined_data, aes(x = source, y = log_gdp, fill = source)) +
geom_boxplot() +
labs(title = "Log-transformed GDP of counties with monitors and counties without monitors",
subtitle = "GDP of counties with monitors is significantly higher",
x = "County Type",
y = "Log of GDP") +
theme_minimal() +
theme(legend.position = "none")
From the plot, it’s apparent that counties with monitors tend to have higher log GDP values — indicated by a higher median (the line within the box), as well as a larger interquartile range (the height of the box itself). The boxplot displays a noticeable number of outliers for counties without monitors (Other County), as indicated by the black dots beyond the whiskers of the boxplot. These outliers represent counties with GDP values that deviate significantly from the typical GDP range of other counties in the same category. The presence of these outliers suggests that while the overall GDP for counties without monitors may be lower on average, there are still some counties in this group with GDPs that are much higher than the median, potentially comparable to those of counties with monitors.
For our statistical analysis, we intend to compare the log-transformed GDP values of counties with and without air quality monitors to determine if the differences observed are statistically significant. Typically, a t-test is employed for such comparisons, as it assesses whether the means of two groups are statistically distinct from each other. However, the t-test has key assumptions that must be satisfied for accurate application: the data distributions should be approximately normal, and the variances between the two groups should be similar.
To validate these assumptions, we use histograms and Q-Q plots for the monitor_county_clean and other_county datasets. We use ggplot to generate plot and specify we want to generate histogram by using geom_histogram. This helps visualize the distribution of the log-transformed GDP values, with a bin width set to 1 and distinct colors for each group for clear differentiation.
Following the histograms, we creared Q-Q plot for each dataset by using geom_qq(). In these plots, the data’s quantiles are compared against the quantiles from a normal distribution. The closer the points lie to the reference line (added with geom_qq_line()), the more the data’s distribution resembles a normal distribution. 16
# Histogram for monitor_county_clean
ggplot(monitor_county_clean, aes(x = log_gdp)) +
geom_histogram(binwidth = 1, fill = 'blue', alpha = 0.7) +
labs(title = "Histogram of Log GDP for counties with monitors",
subtitle = "Log GDP for counties with monitors is approximately normally distributed",
x = "Log of GDP", y = "Frequency")
# Histogram for other_county
ggplot(other_county, aes(x = log_gdp)) +
geom_histogram(binwidth = 1, fill = 'red', alpha = 0.7) +
labs(title = "Histogram of Log GDP for counties without monitors",
subtitle = "Log GDP for counties without monitors is approximately normally distributed",
x = "Log of GDP", y = "Frequency")
# Q-Q Plot for monitor_county_clean
ggplot(monitor_county_clean, aes(sample = log_gdp)) +
geom_qq() +
geom_qq_line() +
labs(title = "Q-Q Plot for counties with monitors",
subtitle = "Log GDP for counties with monitors is approximately normally distributed")
# Q-Q Plot for other_county
ggplot(other_county, aes(sample = log_gdp)) +
geom_qq() +
geom_qq_line() +
labs(title = "Q-Q Plot for counties without monitors",
subtitle = "Log GDP for counties without monitors is approximately normally distributed")
From the histograms produced for both datasets, we observe that the distributions of log-transformed GDP values exhibit characteristics of normality. The majority of the data clusters around the center of the distribution and gradually decrease towards both ends. This bell-shaped pattern suggests that the log GDP values for both counties with and without monitors are approximately normally distributed. The Q-Q plots for counties with and without monitors show the quantiles of the log-transformed GDP data plotted against the expected quantiles of a normal distribution. The data points largely follow the reference line, suggesting that the log GDP is approximately normally distributed. However, there are deviations observable at the lower end of the distribution, indicating some skewness or outliers in the data.
Then, we decided to check if the two datasets have similar variances. To do this, we conducted Levene’s Test. The Levene test checks whether several groups have the same variance in the population. It is used to test the null hypothesis that the samples to be compared come from a population with the same variance. If the p-value for the Levene test is greater than .05, then the variances are not significantly different from each other (i.e., the homogeneity assumption of the variance is met). If the p-value for the Levene’s test is less than .05, then there is a significant difference between the variances.17
We began by converting the ‘source’ column in our combined_data dataset into a factor because Levene’s Test requires the grouping variable to be categorical. In R, categorical variables are typically handled as factors. This ensures that the statistical functions recognize the variable correctly and perform the group comparisons as intended. We then executed leveneTest using the leveneTest function is from the car package, specifying the log-transformed GDP as our test variable and ‘source’ as the grouping factor.
The output from levene_test would inform us about the equality of variances between the two groups. If the test yielded a p-value less than 0.05, it would suggest a significant difference in variances, which would mean we should use statistical approach other than t-test.
combined_data$source <- as.factor(combined_data$source)
levene_test <- leveneTest(log_gdp ~ source, data = combined_data)
print(levene_test)
## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 1 132.65 < 2.2e-16 ***
## 3344
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The output from Levene’s Test indicates a significant difference in the variances between the two groups. The test gives an F value of 130.6, which is considerably high, and a p-value that is less than 2.2e-16, well below the conventional threshold of 0.05 for statistical significance.
Given the p-value is so small, we can confidently state that the variances of the log-transformed GDP are not the same for the two groups of counties. The assumption of equal variances needed for a traditional t-test is violated, suggesting that we should consider alternative analysis methods, such as Welch’s t-test. Welch’s t-test assumes that both groups of data are sampled from populations that follow a normal distribution, but it does not assume that those two populations have the same variance 18.
We used the t.test function to conduct Welch’s t-test on our combined_data dataset. We specifically employed Welch’s t-test in our analysis by setting var.equal = FALSE in the t.test function call, meaning that the datasets do not have similar variances. We used the “log_gdp ~ source” within the t.test function to compare the log-transformed GDP values across two distinct groups of counties, as specified by the ‘source’ variable previously.
# Welch's t-test
welch_test_result <- t.test(log_gdp ~ source, data = combined_data, var.equal = FALSE)
print(welch_test_result)
##
## Welch Two Sample t-test
##
## data: log_gdp by source
## t = 47.142, df = 1204.4, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group Monitor County and group Other County is not equal to 0
## 95 percent confidence interval:
## 2.899056 3.150836
## sample estimates:
## mean in group Monitor County mean in group Other County
## 16.57966 13.55472
We found a highly significant difference in the mean log-transformed GDP between counties with and without monitors. The degrees of freedom for the test were about 1204.3. The t-test yielded a t-value of 47.041, which is substantially higher than the critical t-value of approximately 1.96 for degrees of freedom over 1000 19. This high t-value, coupled with a remarkably low p-value (less than 2.2e-16), strongly indicates that the difference in means is not due to random chance.
In summary, our analysis indicates that counties with air quality monitors in the United States tend to have higher GDP compared to counties without monitors. This conclusion is supported by statistical tests, Welch’s Two Sample t-test, which show significant differences in economic performance between the two groups of counties. The evidence suggests that the presence of air quality monitoring infrastructure is associated with more substantial economic development. This is probably because counties with robust economies may place a higher priority on environmental regulations and compliance. This insight can be valuable for policymakers and stakeholders interested in the economic implications of environmental monitoring.
https://www.who.int/teams/environment-climate-change-and-health/air-quality-and-health/health-impacts↩︎
https://www.who.int/teams/environment-climate-change-and-health/air-quality-and-health/health-impacts↩︎
https://www.cdc.gov/air/particulate_matter.html#:~:text=Coarse%20(bigger)%20particles%2C%20called,or%20even%20into%20your%20blood.↩︎
https://www.eea.europa.eu/en/topics/in-depth/air-pollution/eow-it-affects-our-health↩︎
https://ehjournal.biomedcentral.com/articles/10.1186/1476-069X-13-63↩︎
https://ehjournal.biomedcentral.com/articles/10.1186/1476-069X-13-63↩︎
https://ehjournal.biomedcentral.com/articles/10.1186/1476-069X-13-63↩︎
https://library.virginia.edu/data/articles/understanding-q-q-plots↩︎
https://socialsci.libretexts.org/Bookshelves/Anthropology/Linguistics/Corpus_Linguistics%3A_A_Guide_to_the_Methodology_(Stefanowitsch)/14%3A_Statistical_Tables/14.4%3A_Critical_values_for_Welch's_(t)-Test↩︎